TeiVM2, Main, Exploration, bibRecord, 000059

Mining a corpus of biographical texts using keywords

Identifieur interne : 000059 ( Main/Exploration ); précédent : 000058; suivant : 000060

Mining a corpus of biographical texts using keywords

Source :

Literary and Linguistic Computing [ 0268-1145 ] ; 2010-04.

RBID : ISTEX:7C87E90B31174A63CC37AC19EBEA8261A9D411BB

Abstract

Using statistically derived keywords to characterize texts has become an important research method for digital humanists and corpus linguists in areas such as literary analysis and the exploration of genre difference. Keywordsand the associated concepts of keyness and key-keynesshave inspired conferences and workshops, many and varied research papers, and are central to several modern corpus processing tools. In this article, we present evidence that (at least for the task of biographical sentence classification) frequent words characterize texts better than keywords or key-keywords. Using the nave Bayes learning algorithm in conjunction with frequency-, keyword-, and key-keyword-based text representation to classify a corpus of biographical sentences, we discovered that the use of frequent words alone provided a classification accuracy better than either the keyword or key-keyword representations at a statistically significant level. This result suggests that (for the biographical sentence classification task at least) frequent words characterize texts better than keywords derived using more computationally intensive methods.

Url:

https://api.istex.fr/document/7C87E90B31174A63CC37AC19EBEA8261A9D411BB/fulltext/pdf

DOI: 10.1093/llc/fqp035

Affiliations:

Japon

Links toward previous steps (curation, corpus...)

to stream Istex, to step Corpus: 000385
to stream Istex, to step Curation: 000385
to stream Istex, to step Checkpoint: 000030
to stream Main, to step Merge: 000059
to stream Main, to step Curation: 000059

Le document en format XML

<record><TEI wicri:istexFullTextTei="biblStruct"><teiHeader><fileDesc><titleStmt><title>Mining a corpus of biographical texts using keywords</title>
<author wicri:is="90%"><name sortKey="Conway, Mike" sort="Conway, Mike" uniqKey="Conway M" first="Mike" last="Conway">Mike Conway</name>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:7C87E90B31174A63CC37AC19EBEA8261A9D411BB</idno>
<date when="2010" year="2010">2010</date>
<idno type="doi">10.1093/llc/fqp035</idno>
<idno type="url">https://api.istex.fr/document/7C87E90B31174A63CC37AC19EBEA8261A9D411BB/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">000385</idno>
<idno type="wicri:Area/Istex/Curation">000385</idno>
<idno type="wicri:Area/Istex/Checkpoint">000030</idno>
<idno type="wicri:explorRef" wicri:stream="Istex" wicri:step="Checkpoint">000030</idno>
<idno type="wicri:doubleKey">0268-1145:2010:Conway M:mining:a:corpus</idno>
<idno type="wicri:Area/Main/Merge">000059</idno>
<idno type="wicri:Area/Main/Curation">000059</idno>
<idno type="wicri:Area/Main/Exploration">000059</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title level="a">Mining a corpus of biographical texts using keywords</title>
<author wicri:is="90%"><name sortKey="Conway, Mike" sort="Conway, Mike" uniqKey="Conway M" first="Mike" last="Conway">Mike Conway</name>
<affiliation wicri:level="1"><country xml:lang="fr">Japon</country>
<wicri:regionArea>National Institute of Informatics</wicri:regionArea>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">Japon</country>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series><title level="j">Literary and Linguistic Computing</title>
<idno type="ISSN">0268-1145</idno>
<idno type="eISSN">1477-4615</idno>
<imprint><publisher>Oxford University Press</publisher>
<date type="published" when="2010-04">2010-04</date>
<biblScope unit="volume">25</biblScope>
<biblScope unit="issue">1</biblScope>
<biblScope unit="page" from="23">23</biblScope>
<biblScope unit="page" to="35">35</biblScope>
</imprint>
<idno type="ISSN">0268-1145</idno>
</series>
<idno type="istex">7C87E90B31174A63CC37AC19EBEA8261A9D411BB</idno>
<idno type="DOI">10.1093/llc/fqp035</idno>
<idno type="ArticleID">fqp035</idno>
</biblStruct>
</sourceDesc>
<seriesStmt><idno type="ISSN">0268-1145</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass></textClass>
<langUsage><language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front><div type="abstract">Using statistically derived keywords to characterize texts has become an important research method for digital humanists and corpus linguists in areas such as literary analysis and the exploration of genre difference. Keywordsand the associated concepts of keyness and key-keynesshave inspired conferences and workshops, many and varied research papers, and are central to several modern corpus processing tools. In this article, we present evidence that (at least for the task of biographical sentence classification) frequent words characterize texts better than keywords or key-keywords. Using the nave Bayes learning algorithm in conjunction with frequency-, keyword-, and key-keyword-based text representation to classify a corpus of biographical sentences, we discovered that the use of frequent words alone provided a classification accuracy better than either the keyword or key-keyword representations at a statistically significant level. This result suggests that (for the biographical sentence classification task at least) frequent words characterize texts better than keywords derived using more computationally intensive methods.</div>
</front>
</TEI>
<affiliations><list><country><li>Japon</li>
</country>
</list>
<tree><country name="Japon"><noRegion><name sortKey="Conway, Mike" sort="Conway, Mike" uniqKey="Conway M" first="Mike" last="Conway">Mike Conway</name>
</noRegion>
<name sortKey="Conway, Mike" sort="Conway, Mike" uniqKey="Conway M" first="Mike" last="Conway">Mike Conway</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Ticri/explor/TeiVM2/Data/Main/Exploration

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000059 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000059 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Ticri
   |area=    TeiVM2
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     ISTEX:7C87E90B31174A63CC37AC19EBEA8261A9D411BB
   |texte=   Mining a corpus of biographical texts using keywords
}}

This area was generated with Dilib version V0.6.31.
Data generation: Mon Oct 30 21:59:18 2017. Site generation: Sun Feb 11 23:16:06 2024

	Serveur d'exploration sur la TEI
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration sur la TEI

Mining a corpus of biographical texts using keywords

Mining a corpus of biographical texts using keywords

Source :

Abstract

Links toward previous steps (curation, corpus...)

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri